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ABSTRACT 

The  problem  of  generating  synthetic  data  for  the  training  and  evaluation  of 
document  analysis  systems  has  been  widely  addressed  in  recent  years.  With  the  increased 
interest  in  processing  multilingual  sources,  however,  there  is  a  tremendous  need  to  be 
able  to  rapidly  generate  data  in  new  languages  and  scripts,  without  the  need  to  develop 
specialized  systems.  We  have  developed  a  system,  which  uses  language  support  of  the 
MS  Windows  operating  system  combined  with  custom  print  drivers  to  render  tiff  images 
simultaneously  with  windows  Enhanced  Metafile  directives.  The  metafile  information  is 
parsed  to  generate  zone,  line,  word,  and  character  ground  truth  including  location,  font 
information  and  content  in  any  language  supported  by  Windows.  The  resulting  images 
can  be  physically  or  synthetically  degraded  by  our  degradation  modules,  and  used  for 
training  and  evaluating  Optical  Character  Recognition  (OCR)  systems.  Our  document 
image  degradation  methodology  incorporates  several  often-encountered  types  of  noise  at 
the  page  and  pixel  levels.  Examples  of  OCR  evaluation  and  synthetically  degraded 
document  images  are  given  to  demonstrate  the  effectiveness. 
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Chapter  1  Introduction 


During  the  past  several  decades,  document  image  analysis  and  Optical 
Character  Recognition  (OCR)  have  been  widely  successful.  Many  desktop  solutions 
continue  to  appear  and  work  well  for  high  quality  input.  Although  many  commercial 
OCR  products  have  merged  in  the  market,  this  consolidation  is  helping  customers  with 
more  complete  document  image  conversion,  storage  and  retrieval  solutions.  With  help 
from  advanced  image  processing  methods  and  pattern  recognition  techniques,  some  of 
the  OCR  vendors  claim  a  near  100%  accuracy  rate  on  typical  office  documents. 
However,  there  are  still  some  open  problems,  such  as  improving  OCR  accuracy  on 
poor  quality  images  from  devices  such  as  fax,  dot  matrix  and  impact  printers,  and 
photocopiers,  or  from  physically  degraded  documents.  Furthermore,  complex  layouts, 
multiple-languages  and  combined  content,  such  as  handwriting  annotations,  provide 
additional  challenges. 

In  this  thesis,  we  are  focusing  on  how  to  generate  representative  training  data 
and  how  to  evaluate  systems  in  support  of  these  open  problems.  It  is  well  known  that 
the  accuracy  of  a  recognition  system  depends  not  only  on  the  features  and  classifiers, 
but  also  on  the  size  and  quality  of  training  sets.  Obtaining  a  significant  corpus  of 
document  images  and  symbolic  ground  truth  is  an  important  precursor  for  training  and 
evaluating  document  analysis  systems.  Often,  documents  are  scanned  page  by  page, 
ground  truth  text  is  keyed  character  by  character,  and  bounding  boxes  are  drawn 
manually.  This  process  is  labor-intensive  and  error  prone,  and  becomes  increasingly 
difficult  when  processing  multi-lingual  collections  with  thousands  of  pages.  Native 
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speakers  and  a  special  input  environment  are  required  for  non-Latin  ground  truth 
generation,  and  such  tools  may  not  be  widely  available. 

To  bypass  manual  keying  in  symbolic  data,  researchers  have  typically  used 
existing  ground  truth  data  sets.  One  widely  adopted  source  data  set  is  the  University 
of  Washington  (UW)  data  set.  The  first  release  has  two  thousand  English  and 
Japanese  technical  document  images,  and  has  been  widely  used  by  OCR  developers. 
However,  if  the  researchers  want  to  develop  or  test  their  recognition  system  on  other 
document  styles  or  in  other  languages,  those  data  sets  cannot  provide  much  help. 
Some  researchers  have  used  the  Bible  because  it  has  an  additional  advantage  of 
appearing  in  multiple  languages  so  that  it  can  serve  as  a  source  of  parallel  text.  It  is 
not  convenient,  however,  if  the  users  need  to  test  the  OCR  systems  on  their  own 
documents,  which  may  have  specific  vocabulary. 

A  method  using  Device  Independent  (DVI)  files  and  LATEX  typesetting  has 
been  proposed  to  address  this  problem  [5]  by  providing  a  way  to  generate  images  from 
electronic  text.  The  text  is  keyed  in  LATEX  environment  manually,  and  is  compiled 
to  generate  DVI  files.  Those  DVI  files  are  transformed  to  TIFF  images  using 
DVI2TIFF.  Although  this  method  provides  a  convenient  way  to  create  accurate 
ground  truth  files,  the  manual  formatting  is  still  error  prone,  and  may  result  in 
significant  cost  for  multilingual  documents.  Furthermore,  a  native  speaker  is  still 
needed  to  key  in  the  characters  if  we  would  like  to  process  non-Latin  languages. 

The  second  issue  we  are  interested  in  is  how  to  evaluate  OCR  in  an  unbiased 
way.  Because  the  OCR  systems  are  evaluated  on  different  data  sets,  a  99%  accuracy 
level  of  one  OCR  system  is  different  from  a  99%  accuracy  level  of  another  OCR 
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system.  Furthermore,  those  accuracy  rates  are  suspect  when  the  data  sets  used  in 
evaluation  are  not  representative  of  the  intended  document  population. 

As  mentioned  above,  the  prohibitive  expense  of  manually  generated  ground 
truth,  and  the  prior  bias  introduced  by  using  those  existing  data  sets  have  prompted  us 
to  use  synthetic  data  as  a  complement  to  real  data.  The  problem  of  generating 
synthetic  data  for  the  training  and  evaluation  of  document  analysis  systems  has  been 
widely  addressed  in  recent  years. 

In  this  thesis,  we  will  describe  a  multilingual  OCR  evaluation  system,  which 
includes  a  document  formatter,  a  ground  truth  generator  (GTG),  an  integrated 
evaluation  tool,  and  a  document  image  degradation  tool.  This  system  provides  a 
universal  framework  to  generate  training  and  evaluation  data  sets  on  a  large  scale. 
Beginning  with  electronic  text,  our  ground  truth  generator  produces  noise-free  images 
and  ground  truth  files.  Since  the  text  can  be  effortlessly  copied  from  the  Internet  or 
existing  electronic  sources  instead  of  being  manually  keyed  in,  this  method  is 
extremely  helpful  when  dealing  with  new  languages  and  new  scripts.  In  most  cases,  a 
person  who  wants  to  create  the  data  sets  can  do  so  without  being  a  native  speaker  of 
that  language. 

1.1  Scope  of  thesis 

In  this  thesis,  we  address  some  aspects  of  the  document  analysis  system’s 
training  and  evaluation,  and  document  image  degradation,  with  a  focus  on  ground 
truth  generation  and  degradation. 

The  complete  textual  ground  truth  for  a  document  image  includes  symbolic 
text  files,  font,  character  size,  and  position  information  for  each  symbol,  as  well  as  the 
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location  of  regions  containing  graphics,  logos  etc.  In  our  evaluation  system,  noise  free 
images  and  the  physically  or  synthetically  degraded  images  are  fed  to  underlying  OCR 
systems  to  generate  recognized  text.  A  set  of  document  image  degradation  methods 
have  been  proposed  and  implemented  to  generate  synthetic  degradations,  including 
page  level  and  pixel  level  noise. 

The  following  summarizes  the  key  contributions  of  this  thesis: 

•  An  automatic  multilingual  OCR  evaluation  system  has  been  proposed,  and 
implemented.  This  system  includes  document  formatter,  ground  truth 
generator,  font  parser  and  verifier,  and  evaluation  sub-systems. 

•  A  method  to  align  the  ground  truth  files  with  degraded  images  has  been 
proposed  and  implemented.  This  method  uses  linear  transformation  to  model 
the  print-copy-fax-scan  procedure. 

•  A  document  degradation  methodology  has  been  proposed  and  implemented. 
Methods  include  blur,  speckle,  rotation,  jitter,  resolution  change,  pixel  drift, 
horizontal  and  vertical  lines,  and  page  show-through. 

1.2  Organization  of  thesis 

This  thesis  is  organized  into  six  chapters.  In  Chapter  2  we  survey  related  work 
in  the  areas  of  ground  truth  generation,  OCR  evaluation,  and  document  image 
degradation.  We  present  our  multi-lingual  OCR  evaluation  system  in  Chapter  3, 
where  the  system  architecture,  the  Extended  Meta  File  (EMF)  structure,  our  font 
parser  tools  used  to  create  and  verify  the  font  mapping  files,  and  the  evaluation  tools 
are  explained.  Two  Chinese  OCR  systems  are  evaluated  to  illustrate  the  system.  In 
Chapter  4,  a  method  to  align  noise  free  ground  truth  with  degraded  images  is 
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proposed;  and  experiments  of  faxed  and  camera  captured  images  show  the 
effectiveness  of  this  transformation.  A  document  image  degradation  methodology  is 
proposed  in  Chapter  5,  where  page  level  and  pixel  level  degradation  methods  are 
explained  in  detail.  Chapter  6  contains  a  summary  of  the  accomplishments. 
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Chapter  2  Literature  review 


We  focus  our  literature  review  in  the  areas  of  using  synthetic  data  sets, 
generating  ground  truth,  degrading  document  images,  and  OCR  evaluation. 

2. 1  Synthetic  data  sets  and  ground  truth  generation 

Using  synthetic  data,  which  is  “born  digital”  and/or  synthetically  degraded, 
has  many  advantages  over  scanning  and  manual  entry,  including  rapid  generation  of 
datasets  at  lower  cost,  continuous  control  of  degradation  level,  and  convenient  testing 
of  the  same  underlying  document  content  with  different  corruption  methods  [1]. 
Although  many  have  argued  that  synthetic  data  sets  do  not  provide  a  representative 
corpus,  if  used  correctly,  it  can  provide  a  valuable  complement  to  expensive  hand 
created  datasets.  Our  experiments  show  that  there  is  often  no  significant  difference 
between  synthetically  generated  data  and  the  physically  generated  in  terms  of  OCR 
performance.  For  instance,  OCR  achieves  96.67%  accuracy  rate  on  synthetic  data, 
while  96.25%  on  the  same  physically  scanned  document  data  on  300  dpi.  Inspired  by 
the  method  described  in  [2]  to  validate  the  defect  model,  we  can  safely  conjecture  that 
the  synthetic  data  is  validated  if  the  OCR  errors  obtained  are  indistinguishable  from 
the  errors  obtained  when  using  real  scanned  data.  In  general  this  has  proven  to  be  an 
elusive  goal  so  we  will  provide  no  quantitative  validation.  To  validate  the  local 
degradation  models,  the  author  in  [5]  proposed  a  statistic  methodology  based  on  a 
nonparametric,  two-sample  permutation  test,  and  used  a  power  function  to  choose 
algorithm  variables. 
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The  impact  of  image  quality  and  the  representativeness  of  training  image  data 
sets  on  OCR  performance  were  originally  addressed  by  Baird  in  [3].  He  claims 
accuracy  of  a  recognition  system  depends  not  only  on  the  features  and  classifiers,  but 
also  on  the  size  and  the  quality  of  training  sets.  Using  synthetic  data  in  an  appropriate 
way  may  help  determine  the  weaknesses  of  the  underlying  OCR  and  document 
analysis  systems. 

Typically,  the  ground  truth  data  sets  are  created  manually.  Documents  are 
scanned  page  by  page,  ground  truth  text  is  keyed  in  character  by  character,  and 
bounding  boxes  are  drawn  on  the  images  manually.  Because  a  large  quantity  of 
ground  truth  data  is  required  in  order  to  give  an  accurate  measurement  of  the 
performance  of  document  analysis  and  recognition  systems,  researchers  have  created 
some  data  sets  for  training  and  evaluation,  such  as  the  University  of  Washington 
Document  Image  Database  [4].  This  data  set  has  thousands  of  English  technical 
document  images,  and  corresponding  ground  truth  files,  including  zone  and  page 
bounding  boxes,  attributes,  and  ASCII  text  for  each  constituent  document.  It 
provides  a  valuable  platform  to  develop  and  evaluate  underlying  systems.  However, 
this  data  set  is  not  helpful  if  the  target  documents  are  in  other  languages,  other 
document  styles,  or  have  different  quality  levels.  In  those  situations,  the  researchers 
have  to  create  their  own  ground  truth  data  sets. 

To  obtain  ground  truth  datasets  at  minimum  cost,  automated  ground  truth 
generation  methods  have  been  proposed.  In  [5],  the  author  presents  an  approach  to 
obtain  ground  truth  files.  First,  the  document  characters  are  formatted  in  LATEX, 
either  by  manual  transcription  or  by  reformatting  e-text.  The  typesetting  files  are  then 
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compiled  to  device  independent  (DVI)  files.  Ground  truth  can  be  extracted  from 
those  typesetting  files;  while  the  noise  free  document  images  can  be  obtained  by 
using  DVI2TIFF.  The  requirement  of  DVI  files  and  LATEX  typesetting,  however, 
limits  the  practical  application  in  many  cases,  as  LATEX  does  not  support  all 
languages.  Furthermore,  the  manual  entry  is  still  error-prone  and  may  be 
prohibitively  expensive  when  processing  multi-lingual  documents  with  thousands  of 
pages. 

To  overcome  the  inconvenience  of  keying  in  symbolic  data,  researchers  have 
also  used  sources  in  which  both  hard  copy  and  electronic  form  already  exist.  The  use 
of  the  Bible  is  proposed  in  [6]  because  the  electronic  symbolic  ground  truth  exists  in 
many  of  the  world’s  languages,  and  thus  can  serve  as  a  source  of  parallel  text.  In  [6], 
groundtruth  files  in  Arabic,  English,  and  French  Bibles  were  collected,  converted  into 
DVI  files  from  ASCII  text  using  LATEX  typesetting,  and  TIFF  images  were  obtained 
from  those  DVI  files.  The  Arabic  Bible  was  also  physically  scanned.  This  data  set 
provides  a  broader  platform  for  multi-lingual  OCR  training  and  evaluation,  but  it  is 
not  expandable.  Users  may  want  to  test  the  OCR  systems  in  a  specific  domain,  which 
contains  many  modern  words  not  included  in  the  Bible. 

When  groundtruth  is  generated,  there  are  a  variety  of  options  for 
representation.  A  complete  groundtruth  file  should  include  the  information  of 
coordinates  of  each  character,  word,  line,  and  zone  when  possible.  The  higher-level 
information  is  critical  for  tasks,  such  as  document  segmentation  and  layout  analysis. 
After  obtaining  ground  truth  for  ideal  images,  degraded  versions  are  typically 
obtained  by  copying,  faxing,  and/or  rescanning,  but  realigning  the  ground  truth  can  be 
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a  challenge.  In  [8],  the  author  aligns  the  ground  truth  from  ideal  images  with  the 
scanned  images  using  a  linear  transformation  matrix.  The  four  outermost  points, 
which  are  measured  from  the  four  comers  of  all  the  bounding  boxes  of  connected 
components,  are  located  on  both  the  noise  free  image  and  the  corresponding  degraded 
image.  The  coordinates  of  the  four  feature  point  pairs  are  used  to  calculate  the 
projective  transformation.  Then  the  bounding  boxes  of  the  ideal  image  are  mapped  to 
the  degraded  image  using  the  computed  transformation  matrix.  A  local  adjustment  is 
employed  to  compensate  for  nonlinear  factors  in  print-scan  procedure.  Because  they 
are  using  the  four  outermost  bounding  box  comers  as  feature  points,  the  procedure  is 
vulnerable  in  noisy  images  so  that  many  more  points  are  used  to  attack  this  problem 
in  [9], 

Several  ground  truthing  tools  have  been  developed  in  order  to  reduce  the  labor 
of  creating  data  sets.  Groundskeeper  [10]  is  a  tool  to  create  and  edit  document 
segmentation  ground-truth.  This  tool  allows  a  user  to  display  a  document  image, 
draw  zones  of  various  types  around  different  page  regions,  and  label  each  zone  with 
attributes  such  as  type,  sub-type,  parent  zone,  and  attached  zones  etc.  TrueViz  [11]  is 
a  java  program  to  visualize  and  edit  ground  truth  or  metadata  files.  This  tool  provides 
text  editing,  display,  search  functions  based  on  Unicode  for  the  image  and  metadata. 
The  results  are  saved  in  XML  format. 

Synthetic  data  sets  have  been  widely  used  recently.  In  [12],  synthetic  data  is 
generated  and  used  for  training  a  Hidden  Markov  Model  (HMM)  based  Arabic  OCR 
system.  Symbolic  ground  truth  is  keyed  in  and  formatted  in  a  LATEX  environment, 
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while  the  noise  free  images  are  obtained  from  the  DVI  files.  The  procedure  requires  a 
native  speaker,  and  a  specific  input  environment  to  key  in  the  non-Latin  documents. 

Additional  synthetic  training  data  is  utilized  in  [13]  to  improve  the 
performance  of  a  HMM  based  handwriting  recognition  system.  A  perturbation 
model,  which  is  based  on  the  summation  of  a  number  of  CosineW ave  functions,  has 
been  used  to  get  synthetic  text  lines  from  existing  real  handwritten  lines.  Substantial 
improvement  was  observed. 

In  [14],  a  line  drawing  degradation  model  was  proposed  for  the  purpose  of 
evaluating  line  detection  algorithms  using  synthetic  data.  This  model  simulates  some 
types  of  noise,  such  as  Gaussian  noise,  blur,  hard  pencil  noise  and  motion  noise, 
introduced  during  the  production,  and  photocopying  of  technical  documents.  The 
authors  use  “Black  Box  Testing”  method  to  validate  the  model.  They  compare  the 
difference  between  the  real  documents  and  the  synthetically  generated  documents 
with  noise  levels  estimated  from  real  images.  If  the  difference  is  smaller  than  the 
threshold,  this  model  will  be  accepted.  However,  higher  order  statistic  analysis  is 
needed  for  their  validation  method. 

The  work  mentioned  above  suggests  that  if  used  correctly,  the  synthetic  data 
can  provide  a  valuable  complement  to  expensive  manually  created  datasets,  in 
practical  situations. 

2.2  Document  image  degradation  models 

The  study  of  explicit,  quantitative  and  parameterized  models  of  defects 
became  a  focal  point  with  the  work  of  Baird  in  [3],  [15],  [16].  Baird  proposed  in  his 
pioneering  work,  a  parameterized  model  to  approximate  some  aspects  of  the  physics 
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of  machine  printing  and  imaging  of  text,  such  as  affine  transform,  threshold,  and 
speckle.  This  model  accounts  primarily  for  per-symbol  and  per-pixel  defects.  The 
author  also  applied  bootstrapping  and  power  function  analysis  to  this  physics-based 
model  in  [16].  Using  this  model  and  the  synthetically  generated  character  images,  the 
authors  in  [17]  studied  a  binary  tree  classifier’s  accuracy  as  a  function  of  several 
important  model  parameters.  Those  parameters  include  blur,  binarization  threshold, 
and  the  variance  of  pixel  sensor  sensitivity.  They  found  that  two  defects  (blur  and 
threshold)  affect  the  classification  significantly,  continuously,  and  monotonically. 

As  pointed  out  in  [5],  this  model  mainly  advocates  the  use  of  isolated 
degraded  characters,  and  does  not  reflect  some  important  aspects,  such  as  touching 
characters  and  occurrence  probabilities.  The  authors  then  extended  their  work  and 
proposed  a  document  degradation  model  (DDM)  [18],  which  is  based  on  a  local 
morphological  model,  to  randomly  invert  pixels  and  blur  them  during  the  degradation 
procedure.  The  inverting  probability  is  controlled  by  the  decaying  speed  of  an 
exponential  function,  but  accounts  for  only  the  local  statistical  characteristics. 

Both  models  are  used  widely  in  document  analysis  and  recognition  systems. 
For  example,  to  study  the  effect  of  degraded  images  on  a  font  recognition  system  in 
[19],  three  artificially  generated  images  are  used  to  evaluate  the  system.  To  allow  the 
fonts  recognition  from  degraded  images  without  any  specific  training,  two 
transformation  approaches  are  used:  font  model  transformation,  and  feature  values 
transformation.  Baird’s  model  was  used  in  [20]  to  create  large-scale  degraded  image 
data  sets  for  document  image  decoding  system  training  and  evaluation.  Their  work 
shows  high  accuracy  from  trained  models  on  even  severely  degraded  images,  and 
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significant  improvement  compared  to  untrained  models.  Furthermore,  no  manual 
segmentation  is  needed  in  creating  the  training  data  set.  Unfortunately,  neither  of 
these  models  claims  to  handle  “clutter”  noise  that  may  be  present  in  real  documents. 

In  [21],  the  author  proposed  a  method  to  create  a  large  number  of 
groundtruthed  real  images  from  the  existing  data  set  with  a  fraction  of  the  cost.  The 
images  in  an  existing  data  set  are  printed  out,  physically  degraded,  and  then  re¬ 
scanned.  The  degradation  procedure  includes  copying,  smearing,  adding  coffee  and 
ink  stains  etc.  After  re-scanning,  bounding  boxes  are  drawn  manually  on  those 
degraded  images.  The  author  claims  that  creating  the  page  and  zone  box  files  only 
cost  1%  in  the  whole  ground  truthing  procedure.  Although  this  method  can  bootstrap 
existing  data  sets,  it  cannot  create  new  data  sets.  On  the  other  hand,  errors  and  noise 
can  be  introduced  in  the  manual  degradation  procedure  as  well. 

A  two  state  Markov  chain  model  is  proposed  in  [22].  This  method  depicts  the 
document  degradation  with  two  states:  a  random  state  to  model  salt  and  pepper  noise, 
and  a  burst  state  to  model  blurring  over  a  large  document  region.  The  power  function 
in  [5]  is  used  in  this  paper  to  validate  their  model.  To  estimate  the  transition 
probabilities  of  their  model,  a  genetic  algorithm  is  suggested  in  their  paper. 

2.3  OCR  evaluation 

Characterizing  a  profile  of  OCR  systems  provides  useful  information  [23], 
such  as  predicting  OCR  performance  in  a  larger  system,  monitoring  progress  etc.  It  is 
also  very  helpful  to  find  vulnerable  parts  of  the  document  recognition  system  under 
some  circumstances.  Thus  a  valuable  feedback  can  be  obtained  through  the  OCR 
evaluation. 
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As  described  in  [24],  there  are  two  types  of  OCR  evaluation:  black  box 
evaluation  and  white  box  evaluation.  The  black  box  evaluation  treats  the  OCR 
system  as  an  indivisible  unit,  while  the  white  box  evaluation  will  characterize  the 
performance  of  each  sub  modules  of  document  recognition  system,  such  as 
preprocessing,  segmentation,  and  classification  modules.  The  white  box  evaluation  is 
only  applicable  if  the  researcher  can  access  the  intermediate  output  of  the  OCR 
software. 

In  the  following  chapters,  we  present  a  methodology  to  evaluate  multi-lingual 
OCR  systems.  Our  method  includes  a  ground  truth  generator  to  create  complete 
ground  truth  files  and  noise  free  images  automatically,  and  a  tool  to  create 
synthetically  degraded  images,  with  both  the  page  level  and  pixel  level  noise. 
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Chapter  3  Multi-lingual  OCR  evaluation 


In  many  situations,  it  is  useful  to  measure  the  effect  of  underlying  OCR 
software  in  a  cascading  system  with  down  stream  processes,  such  as  Information 
Retrieval  (IR)  and  Machine  Translation  (MT)  system.  As  we  mentioned  before,  using 
existing  data  sets  can  be  biased;  while  manually  ground  truthing  can  be  prohibitively 
expensive,  and  become  extremely  difficult  when  processing  multi-lingual  collections 
with  thousands  of  pages. 

In  this  chapter,  we  present  a  methodology  to  generate  noise  free  document 
images  and  symbolic  groundtruth  files  automatically  using  a  custom  print  driver  and 
meta-file  information.  The  system  architecture  is  introduced  and  briefly  discussed  in 
Section  3.1.  The  main  component  of  our  system,  the  groundtruth  generator  is 
explained  in  detail  in  Section  3.2.  This  section  also  depicts  the  structure  of  Enhanced 
Meta-File  (EMF)  and  our  font  parser,  which  is  used  to  extract  font-mapping  files 
from  TrueType  Font  (TTF)  files.  In  Section  3.3,  we  describe  the  evaluation  tools. 
We  evaluate  two  major  Chinese  OCR  software  packages  in  Section  3.4. 

3.1  System  overview 

The  architecture  of  our  evaluation  system  is  shown  in  Figure  3.1. 

Beginning  with  electronic  text  in  a  standard  encoding,  documents  are  either 
manually  structured  and  formatted  or  passed  through  an  XMF  formatter  to  obtain  a 
structured  document  instance.  From  the  structured  documents,  we  generate  noise-free 
images  and  ground  truth  files  by  using  custom  print  driver  and  metafile  information 
via  a  parser/renderer.  The  system  relies  on  the  Microsoft  windows  operation 
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system’s  use  of  enhanced  metafile  directives  to  provide  a  unified  representation  that 


includes  Unicode  glyph  information  and  the  physical  location  of  each  character  on  the 


page. 
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Figure  3.  1  OCR  evaluation  system  architecture 
Degraded  images  can  then  be  obtained  physically  by  printing,  scanning  and 

faxing,  or  be  obtained  synthetically  by  using  degradation  methods  (Chapter  5). 

Finally,  those  ideal  and  degraded  images  are  passed  through  an  OCR  system  for 

evaluation.  The  results  can  also  be  used  to  measure  the  effect  of  OCR  on  down 

stream  processes,  such  as  information  retrieval  (IR),  and  machine  translation  (MT). 

Our  method  is  especially  helpful  in  generating  non-Latin  ground  truth.  The 

user  can  copy  and  paste  the  document  in  the  target  language  from  the  website,  then 

create  data  sets  using  our  system. 
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3.2  Ground  truth  generator 

In  our  system,  the  ground  truth  generator  (GTG)  is  used  to  obtain  the  synthetic 
noise  free  images,  and  parse  the  symbolic  ground  truth  files  from  EMF  files. 

First,  the  structured  documents  are  fed  to  GTG  system.  Image  files  at 
different  resolutions  and  the  metafiles  are  obtained  via  a  custom  printer  driver.  From 
the  metafiles,  we  obtain  character  codes,  font  and  layout  information,  for  each  symbol 
rendered.  Because  the  ground  truth  files  are  parsed  from  metafiles,  which  rely  only 
on  the  font  files  installed  on  the  computer,  the  Unicode  and  original  coding 
groundtruth  files  can  be  obtained  accurately  and  rapidly.  We’ve  tested  our  system  on 
dozens  of  languages,  including  Arabic,  Chinese,  Farsi,  Hindi,  Japanese,  Korean,  Thai, 
and  Pashto,  and  our  system  provides  a  universal  framework  to  generate  groundtruth 
files  for  multi  lingual  documents. 

For  debugging  purpose,  images  and  layout  information  are  used  to  create 
overlaid  images,  where  the  bounding  boxes  are  displayed  at  the  character,  word,  line, 
and  zone  levels.  Examples  in  different  zoning  levels  are  given  in  Figure  3.2. 
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Figure  3.  2  Examples  of  overlaid  images,  (a)  Chinese  document  image  at  the  character  level; 
(b)  English  document  image  at  the  word  level;  (c)  Japanese  document  image  at  the  line  level; 
(d)  Arabic  document  image  at  the  zone  level 

Three  kinds  of  ground  truth  files  are  generated  by  GTG:  core  ground  truth 


files,  raw  ground  truth  files,  and  structured  ground  truth  files. 
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•  Core  ground  truth  files  contain  the  position  information  at  each  symbol, 
line,  zone  level  and  the  identity  of  each  symbol  (Figure  3.3). 

•  Raw  files  are  in  Unicode  format  or  in  original  coding  format.  These  files 
can  be  used  to  compare  with  OCR  results  in  evaluation.  As  in  Chinese 
OCR  evaluation,  if  the  OCR  output  is  in  GB2312,  then  the  raw  ground 
truth  files  in  original  encoding  should  be  used  in  comparison. 


•  Structured  ground  truth  files  include  HTML  files  and  XML  files.  As  a 
debugging  tool,  the  HTML  files  can  be  used  to  check  whether  the  ground 


truth  file  is  the  same  with  the  original  text  (Figure  3.4).  The  XML  files 
are  used  for  data  exchange  or  storage  (Figure  3.5). 
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Figure  3.  3  Core  ground  truth  file  example 
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The  first  several  lines  in  core  ground  truth  files  contain  page  level 
information,  including  the  coordinates  of  the  bounding  box  for  the  content,  page  size, 
resolution,  and  fonts.  After  this  header,  we  enumerate  the  ground  truth  in  a  tree 
structure.  Each  item  of  ground  truth  is  listed  beginning  with  its  category  label,  such 
as  zone,  line,  or  char  for  character  item.  Following  the  label  is  coordinate 
information  of  the  bounding  box  in  parenthesis,  and  additional  metadata  such  as  a  “T” 
for  a  text  zone,  or  an  “F”  for  a  figure  (or  non-text  zone).  All  the  children  items 
belonging  to  a  parent  item  follow  it  in  read  order.  In  each  word  item,  we  use  an 
integer  number,  which  corresponds  to  the  font  (described  above),  as  the  property  of 
the  word  item.  For  each  character  item,  we  provide  the  character  in  decimal  Unicode, 
in  hex  Unicode,  and  the  glyph  index  as  the  properties  after  the  coordinate 
information. 


Figure  3.  4  HTML  ground  truth  file  example  for  Thai 
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The  user  can  check  the  correctness  of  obtained  ground  truth  by  comparing  this 
HTML  file  with  the  original  document,  assuming  the  browser  correctly  renders 
Unicode. 

The  synthetic  images  are  noise-free  images  in  different  resolutions.  Those 
images  can  be  synthetically  degraded  by  our  degradation  tool,  or  physically  degraded 
by  printing,  coping,  faxing,  and  scanning,  as  shown  in  the  next  chapter.  The  synthetic 
images  and  degraded  images  can  be  used  to  evaluate  OCR  systems. 

3.2.1  Obtaining  ground  truth  files  from  the  enhanced  meta  files 

All  the  ground  truth  files  are  parsed  from  the  Enhanced  Meta  File  (EMF), 
which  consists  of  a  sequence  of  recorded  GDI  commands  covering  all  major  areas  of 
GDI  functions  [25].  EMF  is  used  as  a  generic  graphics  data  exchange  format  that 
supports  all  major  elements  of  graphics,  including  pixels,  lines,  curves,  texts  and 
bitmaps.  Our  major  goal  is  to  extract  all  the  symbols  in  Unicode  and  the  position 
information  for  each  character  from  the  EMF  file.  The  Unicode  provides  a  unique 
number  for  every  character,  independent  of  the  platform,  the  program,  or  the 
language  [26].  The  ground  truth  in  Unicode  can  be  compared  with  OCR  output, 
which  should  be  in  Unicode  also.  If  the  OCR  output  is  in  another  encoding,  say  in 
GB2312  for  Chinese,  we  need  to  translate  the  OCR  output  into  Unicode  or  use  the 
raw  files  generated  from  GTG.  The  position  information  is  the  coordinates  of  the 
bounding  box  for  each  symbol,  and  can  be  used  to  parse  the  word,  line,  or  zone 
positions.  This  higher-level  information  is  critical  in  evaluating  the  document 
segmentation  systems. 
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We  use  a  custom  printer  driver  to  get  the  binary  EMF  files  from  structured 


text  documents.  Figure  3.5  shows  example  records  in  an  EMF  file. 

HEADER 

rc 1 Bounds = (359, 281,  13 11, 383  ) 

rc  lFraine=  (0,0,21300,27300) 

dSignature=0x464D452G 

nVersion=0x000 10000 

nBytes=33860 

nRecords=98 

nHandles=  6 

sReserved=0 

offDescription=108 

nPalEntr ies=0 

szlDevice= (2520,3220) 

szlllillimeters  =  (213,273) 

cbPixelFormat=0 

of  f  PixelFormat=G 

bOpenGL=0 

EXTCREATEFONTINDIRECTW 

ihFont=l 

elf  nr.  elf Culture=522 

elftr. elf FullHame= 

elf  nr.  elfLogFont .  If ChacSet=0 

elftr. elfLogFont.lfClipPrecision=  64 

elftr. elfLogFont . If Escapement=0 

elftr. elf LogFont . If FaceName=Ac ial 

elf w. elf LogFont . If Height=-42 

elf  it.  elf  LogFont .  If ltalic=0 

elftr.elfLogFont.  If Orient at ion=0 

elftr. elfLogFont. If Out Precis ion=4 

elftr.  elf  LogFont .  If  PitchAndFamily=34 

elftr.  elf  LogFont .  If  Quality=0 

elftr. elf LogFont . If Str ikeOut=0 

elftr.  elfLogFont.  If  Under  line=0 

elfw.  elfLogFont .  If Weight=400 

elftr.  elfLogFont .  If  Width=0 

elftr.  elf  Mat  ch=  12  2  8892 

Figure  3.  5  EMF  file  example 

A  parser  written  in  VC++  is  used  to  extract  the  information  from  EMF 
records.  Following  are  the  corresponding  records  used  in  our  parser: 

•  HEADER 

From  this  record,  we  can  obtain  the  page  size,  image  resolution,  and  the  content 
rectangle  in  mm  units. 

•  EXTCREATEFONTINDIRECTW 
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This  record  provides  the  font  information  and  font  properties,  such  as  font  face 
name,  character  set,  font  height,  and  whether  it  is  italic  or  underlined.  This  record 
is  very  important  in  case  of  glyph  indices  used  in  records  of  EXTTEXTOUTW 
instead  of  Unicode.  Once  we  know  the  font,  we  can  retrieve  the  corresponding 
Unicode  using  the  glyph  indices  via  the  mapping  file.  The  character  set 
information  helps  us  to  get  the  original  code  points  other  than  the  Unicode.  For 
example,  although  most  of  Chinese  documents  are  coded  in  Unicode,  we  may  still 
need  GB2312  code  points  for  evaluating  OCR  software  if  the  OCR  only  outputs 
GB2312  code  points. 

•  EXTTEXTOUTW 

With  this  record,  we  can  have  the  coordinate  information  of  each  bounding  box 
for  the  string,  the  code  point  of  each  character,  and  the  offset  for  characters.  By 
checking  the  bit  of  “emrtext.fOptions”  value,  we  can  determine  whether  the 
following  code  points  are  in  Unicode  or  just  glyph  indices. 

If  those  codes  are  in  Unicode,  we  are  almost  done.  If  they  are  glyph  indices, 
we  need  to  map  them  to  corresponding  Unicode,  which  will  be  covered  in  detail  in 
the  next  section. 

3.2.2  Font  mapping  files  and  parser  tools 

Glyph  indices  are  decoded  in  the  font  files,  and  used  to  inform  the  operating 
system  how  to  draw  characters  on  screen  or  the  printer  how  to  print  them.  Each  font 
has  its  own  structure,  and  definition  of  glyph  indices. 

Even  for  the  same  character  in  one  language,  different  font  files  can  have 
different  corresponding  glyph  indices,  and  different  glyph  images.  Table  1  shows  the 
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different  Unicode  sequence  for  the  same  characters  in  Urdu  language.  From  this 
table,  we  can  see  that  TITUS  Cyberbit  Basic  treated  both  characters  in  isolated  form, 
which  is  not  useful  for  Urdu.  Arial  Unicode  MS  treated  them  partially  correct,  but 
does  not  provide  the  right  ligature.  Only  Urdu  Naskh  Asiatype  obtains  the  correct 
ligature  in  this  case. 


Character 

Font 

Glyph  Index 

Unicode 

* 

<2. 

Urdu  Naskh  Asiatype 

302 

0x06D2  0x0646 

■ 

Arial  Unicode  MS 

1342  50860 

0x06D2  0xFEE7 

• 

TITUS  Cyberbit  Basic 

2460  2289 

0x06D2  0x0646 

Table  1  Example  of  same  character  in  different  fonts 
As  far  as  we  know,  characters  and  glyphs  have  three  mapping  relationships: 


1.  One  to  one  mapping.  One  character  is  represented  by  a  single  glyph,  and  one 
glyph  represents  a  single  character.  This  is  common  in  languages  with  large 
character  sets,  such  as  Chinese  and  Korean.  In  this  case,  we  can  easily  retrieve 
the  Unicode  from  glyph  index. 

2.  One  to  many  mapping.  In  this  case,  a  character  may  be  represented  by  a 
combination  of  several  glyphs,  or  one  character  has  more  than  one  presentation 
form.  An  example  for  the  former  case  is  shown  in  Figure  3.6,  where  a  character 
in  a  Hindi  document  is  composed  of  glyph  128,  and  glyph  87.  Glyph  87  can  be 
used  in  other  glyph  composites  as  well.  For  this  case,  we  must  retrieve  all 
components  of  the  character,  and  find  all  corresponding  Unicode  code  points. 

=  cf>  +  ^ 

Figure  3.  6  Composition  of  one  character  from  several  glyph  indices 
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The  case  of  multiple  forms  for  one  character  is  common  in  Arabic  documents, 
where  contextual  glyph  forms  are  heavily  used.  There  are  four  forms,  initial,  middle, 
final,  and  isolated,  for  the  same  character,  depending  on  the  context.  For  example  the 
character  “ha”  in  Arabic  can  be  represented  in  four  forms,  as  shown  in  Figure  3.7. 

0  Jb  4. 

Figure  3.  7  Arabic  character  “ha”  in  isolated,  initial,  middle,  and  final  form 
3.  No  explicit  mapping  from  glyph  index  to  Unicode.  For  example,  Devanagari,  the 

language  used  in  India,  only  has  128  code  points,  from  0x0900  to  0x097F  in 

Unicode  table,  but  has  lots  of  different  forms.  Using  0x094D,  VIRAMA  in  Hindi 

language,  we  can  get  the  half  consonants  from  the  full  consonant.  Using  0x200C 

and  0x200D,  we  can  get  the  conjuncts  of  consonants  in  different  formats.  In 

Hindi  fonts,  lots  of  composites  are  also  used.  Some  of  the  composites  have 

corresponding  Unicode,  but  not  all  of  them.  For  those  that  don’t,  we  need  to 

provide  the  Unicode  manually. 

We  use  a  program  “ttfdump”  to  access  the  internal  components  of  the  font 
file.  A  typical  dumped  font  file  is  shown  in  Figure  3.8 

In  a  font  file,  “cmap”  contains  this  information.  Usually,  there  is  more  than 
one  sub-table  in  “cmap”  section.  The  sub  table  with  “Platform  ID  3”  and  “Specific 
ID  1”  is  for  Windows  OS  Unicode.  Other  sub-tables  are  used  for  other  operation 
systems  and  other  coding  systems.  We  used  Windows  Unicode  sub-table  in  our 
groundtruth  generator  system.  For  detailed  information  about  the  font  file  structure, 
please  refer  to  [27]. 
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In  this  sub  table,  we  can  find  the  Unicode  character  to  the  corresponding 
glyph  index.  For  most  of  the  language  fonts  tested,  we  can  get  complete  glyph  index 
to  Unicode  mapping.  For  other  languages,  such  as  Arabic  and  Hindi  documents,  we 


need  to  get  more  information  from  other  parts  of  font  file. 


;  TrueType  vl . □  Dump  Program  -  vl . 63 ,  Apr  11  1996,  rrt,  dra,  gch,  ddb,  lcp 
;  Copyright  (C)  1991  ZSoft  Corporation.  All  rights  reserved. 

;  Portions  Copyright  (C)  1991-1995  Microsoft  Corporation.  All  rights  reserved. 

;  Dumping  file  1  C: \ gangzi\ project\ f ont\ program\ f ont_table_netj\ f ont_table_c\ test\ uingding. ttf 1 
Offset  Table 


sfnt  version: 

1.0 

numTables  = 

18 

searchRange  = 

256 

entrySelector  = 

4 

rangeShift  = 

32 

0. 

1 DSIG1 

- 

chksm 

= 

0x4A66FCEE, 

off 

= 

0x00012858, 

len 

= 

513  6 

1. 

1 LTSH1 

chksm 

= 

0x3939391A, 

off 

= 

0x00001E18, 

len 

= 

230 

2  . 

1  OS/2  1 

- 

chksm 

= 

0x3  1C  6E48A, 

off 

= 

0x00000 1A8, 

len 

= 

36 

3  . 

1 VDMX' 

- 

chksm 

= 

0xF374DAB3, 

off 

= 

0X00001F00 , 

len 

= 

3  004 

4. 

1 cmap 1 

- 

chksm 

= 

0X3BEFC73C, 

off 

= 

0x00001950, 

len 

= 

768 

5. 

1  cvt  1 

- 

chksm 

= 

0xAE2FA9A9, 

off 

= 

0x00003270, 

len 

= 

133  8 

6. 

1  f  pgm1 

- 

chksm 

= 

0xC4F43BB0, 

off 

= 

0x00002  E 10 , 

len 

= 

1119 

7. 

1  gasp  1 

- 

chksm 

= 

0x082  3000A, 

off 

= 

0x00000200, 

len 

= 

20 

8. 

1 giyf 1 

- 

chksm 

= 

0XFEF43833, 

off 

= 

0X0000509C, 

len 

= 

52534 

9. 

1 hdmx 1 

- 

chksm 

= 

0xEE699DAl, 

off 

= 

0x0G003B34, 

len 

= 

5480 

10. 

1  head1 

- 

chksm 

= 

0xC9876654, 

off 

= 

0x0000012C, 

len 

= 

54 

11. 

1 hhea1 

- 

chksm 

= 

Ox 12 130A8E , 

off 

= 

0x00000164, 

len 

= 

36 

12  . 

1  hint  x  1 

- 

chksm 

= 

0X49428A89, 

off 

= 

0x00003 7 AC, 

len 

= 

904 

13  . 

1  loca1 

- 

chksm 

= 

0X25BAEF7C, 

off 

= 

0X000G1C50, 

len 

= 

454 

14. 

1 maxp 1 

- 

chksm 

= 

0x0356062 A, 

off 

= 

0x00000188, 

len 

= 

32 

15. 

1  name  1 

- 

chksm 

= 

0x8 1A52 B9D , 

off 

= 

0x00000214, 

len 

= 

5947 

16. 

1  post 1 

- 

chksm 

= 

0x596C3A57, 

off 

= 

0x000 11DD4 , 

len 

= 

2  692 

17. 

1  prep  1 

- 

chksm 

= 

0X4FC7275F, 

off 

= 

0x00002 ABC, 

len 

= 

852 

1  cmap 1  Table  -  Character  To  Index  Map 


Size  =  768  bytes 
'cmap'  version:  0 
numTables:  2 


Figure  3.  8  Example  of  a  dumped  font  file 

We  have  a  font  parser  to  automatically  generate  the  font  mapping  files. 
Following  is  the  pseudo  code  we  used  in  font  parser  program: 


While  (  fontfile  is  not  end) 

{ 

switch  ( glyph_index_type ) 

{ 

case  one_to_one: 

add_oneNode  (  glyphlndex,  Unicode); 
break; 

case  composite: 
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Unicode  =  find_composite(glyphIndex); 

If  (Unicode  ==  NULL) 

Record_error(glyphIndex) ; 
Draw_glyph(glyphlndex); 

else 

Add_oneNode(  glyphlndex,  Unicode); 

} 

} 


Using  our  font  parser  tool,  two  output  files  and  one  directory  are  obtained  for 
each  font  file:  the  font  mapping  file,  the  font  verification  file  and  the  glyph  images 
directory.  The  font-mapping  file  is  used  by  the  ground  truth  generator  to  retrieve  the 
Unicode  from  the  glyph  index  if  necessary.  The  font  verification  file  is  written  in  the 
HTML  file,  and  is  used  to  check  the  correctness  of  the  font-mapping  file.  Because 
the  font  for  each  cell  in  the  HTML  file  can  be  specified  individually,  we  can  compare 
the  glyph  image,  which  is  the  image  extracted  from  font  file  directly,  with  the  image 
generated  from  the  Unicode  under  that  font  file.  Figure  3.9  is  an  example  of  a  font 
verification  file  for  Devanagari  language. 

The  first  column  is  the  glyph  indices  ranked  in  ascending  order;  while  the 
second  column  is  the  Unicode  candidate  in  Hex  format  if  available.  If  we  could  not 
find  the  Unicode  candidate  for  this  character  in  the  font  file,  a  “NULL”  label  is  put 
there  to  remind  us  to  map  this  index  to  Unicode  manually  later.  The  real  glyph  image 
for  that  index  is  put  in  the  third  column.  The  forth  column  is  the  image  generated  by 
the  Unicode  candidate  under  the  specific  font,  say  Mangal  in  this  example,  which  is  a 
Unicode  font  for  Devanagari  language. 

Once  all  the  “NULL”  indices  have  been  mapped  manually,  this  font 
verification  file  can  be  generated  again  to  check  the  correctness  of  manual  patches. 
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3  Verification  File  for  Mangal  Regular  -  Microsoft  Internet  Explorer 


File  Edit  View  Favorites  Tools  Help 


^iQlxJ 


Back  ■*” 


j£|  r-2  ^Search  £§J Favorites  ;0Media  ^  |  'r  51 


Address  J©  C :  \gangz  i\projects\H  ind  i\RESULTS\test H  ind  i o\manga  l\manga  l ver .  htm  I 


▼  |  j^Go 


102 

0x090F 

U  1 

2 

103 

0x0910 

^  l1 

2 

104 

0x0911 

3TT 

sir 

105 

0x0912 

3ft 

sir 

106 

0x0913 

afr 

sir 

107 

0x0914 

3ft 

sir 

108 

0x0960 

109 

0x0961 

^  « 

T 

110 

NULL 

WLL 

Figure  3.  9  Example  of  font  verification  file  for  Mangal  font 


Figure  3.  10  Example  of  font  mapping  file  for  Mangal  font 
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Figure  3.10  shows  an  example  of  a  finished  font-mapping  file,  from  which  we 
can  see  that  the  glyph  index  1 10  has  been  decoded. 

Once  we  decoded  all  the  font  information,  and  have  the  glyph  indices  from 
EMF  file,  we  can  obtain  the  ground  truth  file  in  Unicode  and  in  original  coding  page 
by  mapping  the  glyph  index  to  Unicode. 

3.3  Evaluation  tools 

The  algorithm  in  [28]  is  employed  to  evaluate  the  performance  of  underlying 

OCR  systems.  Three  kinds  of  error  are  defined  according  to  three  types  of  edit 

operations  on  the  string:  deletion,  insertion,  and  substitution  errors.  For  example: 

Ground  truth:  comparison 

OCR  output:  c  mtarisonkj 

Operations:  -I-S - DD 

*  Where  I :  Insertion  S:  Substitution  D:  Deletion  -Correct  Recognition 
The  accuracy  in  this  example  is  80% 

We  have  ported  the  code  from  UNIX  [29],  and  integrated  it  into  our 
evaluation  system.  Character  level  and  word  level  accuracy  reports  are  calculated 
page  by  page.  A  summary  is  reported  for  the  whole  document  and  a  collection;  while 
accuracy  confidence  intervals  are  computed  from  the  accuracy  rates. 

For  visualization,  we  have  used  a  scatter  plot  to  compare  between  two  OCR 
runs.  A  diagram  of  sorted  accuracy  reports  for  different  OCR  runs  is  used  first  time 
in  our  evaluation  to  show  the  recognition  pattern  of  systems.  Knowing  which 
patterns  are  the  most  vulnerable  in  recognition  will  help  improve  the  classifier. 

Figure  3.11  is  the  screen  shot  of  our  evaluation  tool  interface.  The  user  can 
choose  the  underlying  OCR  output  files  and  the  ground  truth  files  from  this  interface. 
Depending  on  the  amount  of  OCRed  data,  our  system  can  generate  OCR  profiles 
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from:  structured  folders,  two  directories,  or  two  pages.  Using  a  structure  folder,  we 
can  process  thousands  of  OCRed  pages.  Using  two  directories,  we  process  the 
OCRed  files  and  the  ground  truth  files  in  two  separated  directories. 

For  the  user’s  convenience,  the  OCRed  file  name  need  not  be  the  same  as  the 
ground  truth  file.  Thus  a  different  prefix  or  suffix  can  be  used  to  distinguish  different 
OCR  software  or  different  degradation  levels,  methods  etc.  To  align  the  OCRed  file 
with  corresponding  ground  truth  files,  users  can  specify  the  alignment  file,  use  the  file 
name,  or  even  let  them  be  processed  alphabetically. 


Figure  3.11  Screen  shot  of  evaluation  system  interface 

3.4  Evaluation  examples 
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Two  primary  Chinese  OCR  systems,  OCR1  and  OCR2,  are  tested  in  our 
evaluation  experiments.  We  adopt  the  black  box  method  here. 

We  used  the  available  United  Nations’  Anti-Chemical  Weapon  Treaty,  and 
other  documents  in  our  evaluation.  This  treaty  is  in  six  different  languages:  Arabic, 
Chinese,  English,  French,  Russian,  and  Spanish. 

The  performances  of  the  OCR  systems  are  evaluated  as  follows.  First  the 
ideal  images  and  ground  truth  files  are  generated  by  GTG  from  electronic  text. 
Second,  the  noise-free  images  and  degraded  images,  including  the  physically  scanned 
images  and  synthetically  degraded  images,  are  input  to  the  OCR  systems.  To  get  the 
scanned  images,  we  printed  out  the  document,  and  scanned  it  back  in  200dpi,  300dpi, 
and  400dpi. 

Then  the  OCR  results  from  noise-free  images  and  scanned  images  are 
compared  with  the  ground  truth  files.  Since  the  output  of  both  systems  is  in  GB2312, 
the  raw  files  in  original  encoding  are  used.  Finally,  the  accuracy  rate  is  calculated 
page  by  page,  and  the  overall  evaluation  results  are  obtained  from  those  rates. 

Figure  3.12  shows  the  accuracy  scatter  diagram  of  the  two  OCR  systems  using 
the  100  dpi  and  300  dpi  noise-free  images.  The  horizontal  axis  represents  accuracy 
rate  from  one  of  the  OCR  runs,  while  the  vertical  axis  represents  accuracy  rate  from 
the  other  OCR  run.  The  coordinate  of  each  circle  in  this  diagram  represents  the 
accuracy  rate  pair  of  the  two  OCR  runs  for  each  page  in  the  data  set.  The  x- 
coordinate  corresponds  to  the  OCR2  accuracy  for  that  image  and  the  y-coordinate 
corresponds  to  the  OCR1  accuracy.  For  example,  if  OCR2  obtained  49.46%  accuracy 
rate,  while  OCR1  obtained  92.43%  for  one  page,  then  the  coordinates  of  circle  should 
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be  (49.46,  92.43)  in  diagram.  From  this  example,  we  can  see  that  OCR1  outperforms 


OCR2  on  almost  all  pages. 

OCR1  VS  OCR  2  Accuracy  Scatter  Plot  OCR  1  VS  OCR  2  Accuracy  Scatter  Plot 


Figure  3.12  Accuracy  scatter  plot  for  two  OCR  with  synthetic  images 
Figure  3.13  shows  the  overall  performance  for  physically  scanned  images  at 


different  resolutions.  Calculating  the  accuracy  rate  over  all  pages  will  give  us  more 
accurate  and  more  comprehensive  analysis  for  the  OCR  software. 


OCR  1  VS  OCR  2  In  Scanned  Images 


OCR  1 


OCR  2 


200dpi 


300dpi 


400dpi 


Figure  3.  13  Overall  performances  diagram  for  two  OCR  with  scanned  images 
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Figure  3.14  shows  sorted  OCR  accuracies.  We  sorted  the  accuracy  rates  for 
each  OCR  run,  and  displayed  them  in  this  graph.  From  this  Figure,  we  can  easily  see 
the  highest  accuracy  rate,  lowest  accuracy  rate,  and  performance  pattern  for  each  mn. 


Sorted  OCR  Accuracies 
OCR  l  V.S.  OCR  2 


Figure  3.14  Sorted  accuracy  rate  in  different  runs 
Figure  3.15  is  the  accuracy  scatter  plot  for  the  synthetically  degraded  images 

at  different  noise  levels.  Two  levels  of  DDM  [6]  degraded  images  were  used  here  to 

test  our  OCR  systems.  The  high  noise  level  images  are  generated  by  using  the 

parameters  (Co=0,  ao  =1.0,  a=  0.5,  Po  =1.0,  P  =  2.5,  y=3),  and  the  lower  noise  level 

images  are  generated  by  using  (Co=0,  ao  =1.0,  a=  0.5,  p0  =1.0,  P  =  2.5,  y=2.0).  From 

this  diagram,  we  can  see  that  the  performance  drops  as  the  noise  level  increases. 

Tables  2  and  3  are  the  percentage  accuracy  summaries  (accuracy  and 


confidence  intervals)  for  ideal,  physically  scanned  images. 


OCR1 

OCR2 

100dpi 

300dpi 

100dpi 

300dpi 

ACC  Rate 

86.89% 

99.82% 

65.54% 

88.95% 

ACC  Stat 

86.52~87.27 

99.78~99.85 

63.09~67.97 

88.23~89.68 

Table  2  OCR  Performance  for  synthetic  images 
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OCR1 

OCR2 

200dpi 

300dpi 

400dpi 

200dpi 

300dpi 

400dpi 

ACC  Rate 

95.64% 

96.25% 

96.43% 

87.89% 

90.55% 

91.05% 

ACC  Stat 

94.65~96.67 

95.46~97.07 

95.64~97.26 

86.23~89.58 

88.83~92.33 

89.30~92.85 

Table  3  OCR  Performance  for  scanned  images  with  different  resolutions 

Accuracy  Scatter  Plot 


Lo'jy_N  oise 


Figure  3.15  Accuracy  scatter  plot  for  one  OCR  with  two  noise  level  images 
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Chapter  4  Ground  truth  alignment 


Degraded  images  with  its  original  electronic  text  content  are  often  a  suitable 
test  bed  for  OCR  since  many  evaluation  methodologies  work  by  aligning  entire 
passages  of  text.  In  some  cases,  however,  character,  word  and  line  locations  are  still 
necessary.  The  coordinate  information  is  helpful  in  evaluating  the  segmentation  and 
layout  analysis  capability  of  OCR  system  for  example,  while  the  fonts,  and  character 
size  information  maybe  useful  in  training  OCR  classifier.  Unfortunately,  after  the 
documents  are  printed,  copied,  and  faxed  or  rescanned,  the  original  ground  truth 
location  is  typically  no  longer  aligned  with  the  degraded  images.  In  this  chapter,  we 
will  discuss  a  methodology  to  align  noise  free  images  to  the  degraded  images  in  order 
to  obtain  the  ground  truth  files  for  those  degraded  images. 

In  [5],  the  author  modeled  the  geometric  transformation  of  the  scanned  images 
via  a  linear  transformation  matrix.  This  linear  transformation  assumption  includes 
rotating,  scaling,  shearing,  and  translating.  Using  this  model,  we  have  obtained 
reasonable  results  in  aligning  images  taken  from  a  digital  camera,  or  from  a  print-fax- 
scan  procedure.  If  there  is  a  nonlinear  factor  in  the  degradation  procedure,  our  system 
can  still  provide  a  coarse  bounding  box  for  further  local  adjustment. 

4. 1  Alignment  overview 

Figure  4.1  shows  the  alignment  procedure. 
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Figure  4.  1  Noise  free  image  and  degraded  image  alignment  procedure 
The  following  is  the  detailed  procedure  to  get  the  linear  transformation  matrix. 

1.  Obtain  ideal  images  with  ground  truth  from  our  GTG.  Unlike  the  use  of  the 
four  outermost  points  of  all  the  bounding  boxes  of  connected  components  on 
the  images  in  [5],  we  put  four  disks  in  14  points  at  the  “Header  and  Footer” 
position  of  each  page.  Disks  are  used  in  our  experiment  for  two  reasons.  The 
first  is  that  with  dots  we  can  easily  find  the  geometric  center  from  the 
coordinate  of  bounding  box  even  if  it  degraded  to  an  ellipse.  The  second  is 
that  we  can  detect  an  ellipse  much  easier  than  detecting  a  cross  using 
connected  component  methods. 

2.  Obtain  the  degraded  images  by  printing  the  ideal  images,  copying  them,  and 
faxing  them,  for  example.  Figure  4.2  shows  the  example  of  a  noise  free  image 
and  the  corresponding  faxed  image. 
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"He  lhat  cometh  to  God  must  believe  that  He  is,  and  that  He  is  a  rewarder  of  them  that  diligently 
seek  Him"  (Heb.  1 1:6).  The  object  of  these  studies  is  to  help  those  who  want  to  come  to  Goi 
having  first  believed  "that  He  is";  therefore  we  will  not  concern  ourselves  with  the  evidence  that 
confirms  faith  in  God's  existence.  Examining  the  intricate  structure  of  our  bodies  (cp. 
Ps.  139:14),  the  evident  design  in  a  flower,  gazing  up  into  the  vastness  of  space  on  a  clear  night, 
these  and  countless  other  careful  reflections  on  life  surely  make  atheism  incredible.  To  believe 
that  there  is  no  God  surely  requires  more  faith  than  to  believe  He  exists.  Without  God  there  is  no 
order,  purpose  or  ultimate  explanation  in  the  universe,  and  this  will  therefore  be  reflected  in  the 
life  of  the  atheist.  Bearing  this  in  mind,  it  is  not  surprising  that  the  majority  of  human  beings 
admit  to  a  certain  degree  of  belief  in  a  God  -  even  in  societies  where  materialism  is  the 
prevailing  'god'  of  people's  lives. 

But  there  is  a  vast  difference  between  having  a  vague  notion  that  there  is  a  higher  power,  and 
actually  being  certain  of  what  He  is  offering  in  return  for  faithful  service  to  Him.  Heb.  1 1 :6 
makes  this  point;  we 

"must  believe  that  (God)  is 
AND 

that  He  is  a  rewarder  of  them  that  diligently  seek  Him". 

Much  of  the  Bible  is  an  account  of  the  history  of  God's  people  Israel;  time  and  again  the  point  is 
made  that  their  acceptance  of  God's  existence  was  not  matched  by  their  faith  in  His  promises. 
They  were  told  by  their  great  leader  Moses  “know  therefore.  .. and  consider  it  in  thine  heart,  that 
the  Lord  he  is  God  in  heaven  above,  and  upon  the  earth  beneath:  there  is  none  else.  Thou  shalt 
keep  therefore  his  statutes,  and  his  commandments"  (Dt.4:39,40). 

Thus  the  same  point  is  made  -  an  awareness  within  us  that  there  is  a  God  does  not  mean  that  we 
are  automatically  acceptable  to  God.  If  we  seriously  agree  that  we  really  do  have  a  creator,  we 
should  "keep  therefore  his.  ,  commandments".  It  is  the  purpose  of  this  series  of  studies  to  explain 
what  these  commandments  are  and  how  to  keep  them.  As  we  search  the 
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"He  that  cometh  to  God  must  believe  that  He  is,  and  that  He  is  a  rewarder  of  them  that  diligently 
seek  Him"  (Heb.  11:6).  The  object  of  these  studies  is  to  help  those  who  want  to  come  to  God, 
having  first  believed  "that  He  is";  therefore  we  will  not  concern  ourselves  with  the  evidence  that 
confirms  faith  in  God's  existence.  Examining  the  intricate  structure  of  our  bodies  (cp. 
Ps.l39:14),  the  evident  design  in  a  flower,  gazing  up  into  the  vastness  of  space  on  a  clear  night, 
these  and  countless  other  careful  reflections  on  life  surely  make  atheism  incredible.  To  believe 
that  there  is  no  God  surely  requires  more  faith  than  to  believe  He  exists.  Without  God  there  is  no 
order,  purpose  or  ultimate  explanation  in  the  universe,  and  this  will  therefore  be  reflected  in  the 
life  of  the  atheist.  Bearing  this  in  mind,  it  is  not  surprising  that  the  majority  of  human  beings 
admit  to  a  certain  degree  of  belief  in  a  God  -  even  in  societies  where  materialism  is  the 
prevailing  'god*  of  people's  lives. 

But  there  is  a  vast  difference  between  having  a  vague  notion  that  there  is  a  higher  power,  and 
actually  being  certain  of  what  He  is  offering  in  return  for  faithful  service  to  Him.  Heb.  11:6 
makes  this  point;  we 

"must  believe  that  (God)  is 
AND 

that  He  is  a  rewarder  of  them  that  diligently  seek  Him". 

Much  of  the  Bible  is  an  account  of  the  history  of  God's  people  Israel;  time  and  again  the  point  is 
made  that  their  acceptance  of  God's  existence  was  not  matched  by  their  faith  in  His  promises. 
They  were  told  by  their  great  leader  Moses  "know  therefore. ..and  consider  it  in  thine  heart,  that 
the  Lord  he  is  God  in  heaven  above,  and  upon  the  earth  beneath:  there  is  none  else.  Thou  shalt 
keep  therefore  his  statutes,  and  his  commandments"  (Dt 4:39,40). 

Thus  the  same  point  is  made  -  an  awareness  within  us  that  there  is  a  God  does  not  mean  that  we 
are  automatically  acceptable  to  God.  If  we  seriously  agree  that  we  really  do  have  a  creator,  we 
should  "keep  therefore  his.  ..commandments".  It  is  the  purpose  of  this  series  of  studies  to  explain 
what  these  commandments  are  and  how  to  keep  them.  As  we  search  the 
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Figure  4.  2  Noise  free  image  and  faxed  image  example 

3.  Locate  feature  points  in  the  ideal  images  and  the  corresponding  degraded 

images.  The  position  of  each  feature  point  on  the  noise  free  images  and 
degraded  images  are  detected  by  the  pattern  of  disks.  Following  is  the 
program  used  to  locate  those  feature  points. 

•  Connected  components  are  calculated  on  the  image.  Those 
with  too  small  or  with  large  ratio  of  length  to  width  are 
discarded  as  noise. 

•  All  components  are  checked  to  find  feature  points.  Here  we 
used  the  fact  that  the  ratio  of  area  over  the  multiplication  of 
length  and  width  of  an  ellipse  should  be  a  constant  71.  If  the 
ratio  is  close  to  71,  we  can  label  the  connected  component  as  the 
feature  points. 

•  The  position  of  each  feature  point  is  determined  by  simply 
computing  the  geometric  center  of  the  bounding  box  of 
detected  disk  or  ellipse. 

4.  Compute  the  mapping  matrix  by  the  feature  point  pairs.  The  projective 
transformation  matrix  is  calculated  by: 
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(x] 

C 

A 

pxu  +  p2v  +  p ^ 
pAu  +  p5v  +  p6; 

1 


w  = 


pnu  +  p8v  + 1 


Where  (x,  y)  is  the  coordinates  of  the  feature  point  on  the  ideal  image,  (u,  v)  is 
the  coordinates  of  the  feature  point  on  the  degraded  image.  P,s  are  the 
coefficients  of  the  transformation  matrix. 

5.  We  can  align  the  ground  truth  for  noise-free  image  to  degraded  images  by  the 
calculated  transformation  matrix. 


4.2  Alignment  experiments 

To  test  our  alignment  methodology,  we  printed  those  ideal  images  and 
scanned  them  in  four  combinations  non-skew/skewed  and  shrink/enlarge.  Figure  4.3 
shows  an  example  of  a  skewed  and  shrunk  English  document,  and  a  skewed  and 
enlarged  Chinese  document. 
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Figure  4.  3  Example  of  skew  and  shrink/enlarge  document  example 
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Figure  4.4  shows  an  example  of  a  faxed  English  document  with  the  computed 


bounding  boxes  displayed  on  it. 
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Figure  4.  4  Example  of  aligned  English  document 
Figure  4.5  shows  an  example  of  a  faxed  Chinese  document  with  the  computed 

bounding  boxes  displayed  on  it. 
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Figure  4.  5  Example  of  aligned  Chinese  document 
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We  also  tested  our  alignment  system  with  document  images  obtained  from  a 


digital  camera  with  perspective  distortion.  Figure  4.6  and  4.7  are  document  image 


from  camera  and  the  computed  bounding  boxes  respectively. 
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Figure  4.  6  Document  image  with  complex  content  from  camera 
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Figure  4.  7  Example  of  aligned  camera  Image 
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Chapter  5  Document  image  degradation 


In  this  chapter,  we  present  a  method  to  generate  synthetic  noisy  images  at  both 
page  and  pixel  levels.  The  main  components  of  our  system  and  the  types  of  noise  are 
explained  in  detail  in  following  sections. 

5.1  Image  degradation  architecture: 

The  architecture  of  our  system  is  shown  in  Figure  5.1. 


Figure  5.  1  Image  Degradation  and  Application  Architecture 
Beginning  with  electronic  text,  the  ground  truth  generator  can  produce  noise- 

free  images  and  corresponding  ground  truth  files  by  using  custom  printer  driver  and 

metafile  information.  Degraded  images  can  then  be  obtained  physically  by  printing, 

copying,  faxing,  and  scanning,  or  synthetically  by  using  degradation  methods. 

To  make  our  degradation  method  effective,  while  as  simple  as  possible,  we 

choose  several  types  of  noise  of  greatest  interests.  Some  of  them  have  been  presented 

in  [17].  For  page  level  noise,  we  have  rotation,  blur,  lines,  resolutions  change,  and 

additive  noise  templates.  For  pixel  level  degradation,  we  add  speckles,  jitter,  and 
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pixel  drift  to  the  document.  We  have  also  studied  the  show-through  and  bleed- 
through  in  this  chapter. 


5.2  Page  level  noise 


Skew,  or  rotation,  is  common  in  scanned  or  faxed  documents.  We  have 
implemented  two  methods  for  skew.  The  first  one  takes  the  rotation  angle  as  the 
input,  rotates  the  given  image  at  the  center  of  the  image,  and  resizes  the  image  if 
necessary.  Alternatively,  the  user  can  choose  the  pivot  and  rotation  angle. 

In  general,  we  can  express  the  spatial  transformation,  including  rotation  and 
skew,  using  a  polynomial  function: 


yn  yn 

Z-ii=oZ^ij 


i=0£~ij= 0  V 
N  ^—\N 
i=oZ~/j=0"‘J 


m:;u'vJ 


n::uv] 


where  x,  y  and  u,  v  are  coordinates  in  the  input  and  output  images, 
respectively;  N  is  the  polynomial  order,  and  m,  n  are  coefficients,  which  can  be 
computed  from  the  registered  point  pairs  of  input  and  output  images.  When  N=l,  it  is 
bilinear  interpolation: 


x  =  m00  +  mmv  +  mH)u  +  mnuv 
y  =  n00  +  n0lv  +  nlQu  +  nnuv 

In  our  method,  we  use  the  nearest-neighbor  interpolation  because  it  is  faster 
and  accurate  enough  for  binary  document  images. 

Figure  5.2  shows  the  effect  of  two  kinds  of  rotation.  The  left  one  has  been 
rotated  45°  at  the  center  of  image,  and  has  been  resized  to  adapt  the  change.  The 
right  image  has  been  rotated  10°  without  changing  the  size. 
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Figure  5.  2  Example  of  rotation  of  45°  and  rotation  of  10° 

Blur  is  another  typical  artifact  found  in  degraded  images,  and  is  often  caused 

by  a  point  spread  effect  in  printing  and/or  scanning.  This  type  of  noise,  along  with 

threshold,  significantly  affects  recognition  accuracy  of  almost  any  classifier  [17].  We 

can  model  blur  noise  by  convoluting  the  image  with  a  Gaussian  low  pass  filter.  With 

our  degradation  method,  the  user  can  specify  the  Gaussian  function’s  standard 

deviation  a  (in  unit  of  pixels),  the  size  of  the  spatial  smoothing  mask,  the  convolution 

probability  for  each  pixel,  and  the  threshold  value.  As  shown  in  Figure  5.3,  the  2D 

Gaussian  function  has  been  sampled  at  equal  intervals,  and  has  been  normalized  to 

obtain  the  filter.  Because  the  Gaussian  distribution  is  non-zero  everywhere,  we 

truncate  the  kernel  at  the  point  of  three  standard  deviations  from  the  mean.  For 

example,  if  a  =  1.0,  size  of  mask  is  7,  a  7x7  matrix  will  be  generated.  The  element, 

M(u,  v),  of  this  mask  before  normalization  can  be  calculated  by: 
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M(u,  v)  = 


2kg  ~ 
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Figure  5.  3  2D  Gaussian  Function  and  sampling  grid 

This  2D  convolution  is  quite  slow  in  processing  large  images  with  large 

masks.  However,  the  speed  can  be  increased  by  convolving  the  image  with  two  ID 
Gaussian  filter  on  X  and  Y  direction  separately  because  of  the  separability  of 
Gaussian  filter. 

For  each  foreground  pixel,  the  convolution  probability  will  be  compared  with 
a  uniformly  generated  random  number  to  determine  if  the  convolution  mask  should 
be  applied  on  this  pixel.  If  it  should,  after  the  convolution,  the  threshold  will  be  used 
to  decide  if  the  underlying  pixel  will  be  set  as  foreground  or  as  background.  One 
example  of  an  image  before  and  after  blurring  is  shown  in  Figure  5.4. 

Another  important  page  level  noise  is  the  random  lines  scattered  horizontally 
or  vertically  on  the  document  image.  This  happens  frequently  in  the  faxed  or  copied 
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documents  when  individual  sensors  are  bad.  The  parameters  of  this  kind  of  noise  in 
our  model  include:  the  number  of  lines,  the  minimum,  and  maximum  of  the  length, 
width  and  density  for  the  lines.  Each  of  the  line’s  length  and  width  will  be  chosen 
randomly  in  the  given  range,  and  will  be  overlaid  on  the  image.  The  density 
parameter  controls  the  percentage  of  black  pixels  in  one  line.  Not  only  adding  lines 
to  the  image,  the  user  can  also  randomly  remove  lines  from  the  image.  The  effect  of 
adding  a  horizontal  line  and  removing  vertical  line  can  be  seen  in  Figure  5.5b  and 
5.5c  respectively. 
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Figure  5.  4  Example  of  a  document  before  and  after  blurring 
Merge  is  used  to  combine  the  noise  free  image  with  noise  templates,  which 


are  obtained  from  the  background  of  physically  scanned  images  or  copied  images. 


Using  this  method,  we  can  create  thousands  of  “scanned”  or  “copied”  test  images  at 


minimum  cost.  Figure  5.6  shows  the  effect  of  “additive”  merge.  Although  this 


method  may  not  be  ideal  for  OCR  training  and  testing,  it  works  well  for  document 


image  segmentation  and  analysis  evaluation,  such  as  hand  written  or  logo  detection 
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algorithms.  Figure  5.6  shows  the  original  image,  noise  template,  and  effect  of 
combination  of  noise  template  with  given  image. 


C  lief  Executive  to  renew  ties  with  our  trading  pc  iners 
The  Chief  Executive.  Mr  Tung  Ct  ee  Hwf*.  will  vis  t  Mali 
Singapore,  the  United  States.  Jai  an,  Be  gjijin^rtthe  UK  in  the 
next  two  .TiOf .ths  to  renew  lies[witt«Kjfti  Kting  partners  and 
update  mem  on  developments  in  Hong  f  ong  after  the  handover. 
During  these  visits,  Mr  Tung  will  call  on  political  and  business 
leaders,  and  brief  them  on  the  smooth  transition  in  Hong  Kong  to 
demonstrate  our  pride  to  return  to  China  and  our  confidence  in 
"Ho  'ig  Kong  people  ruling  Hong  Kong"  under  the  "one  country, 
two  systems"  concept 

Mr  'i  ung  will  be  in  Kuala  Lumpur  on  Sept  imber  3  and  call  on 
Pnme  Minister  Dr  Mahathir  Mohamad. 

He  will  be  ir  Singapore  on  September  4  and  5  and  ca"  on  Prime 
Minister  Go  t  Chok  Tong, 

Mr  T ung  wii  be  in  the  United  States  from  September  9  to  1 1 ,  He 
will  visit  Washington  and  Newp'ork. 

In  Washington.  Mr  Tung  looks 
Clinton 

Mr  Tung  will  also  officiate  at  the  opening  ceremony  of  thelnew 
office  premises  of  the  Hong  Kong  Economic  and  Trade  Office  in 
Washington 

Mr  Tung  will  be  in  Tokyo  from  October  15  to  17.  He  will  call  on 
ministers,  parliamentarians  and  senior  officials  of  the  Japanese 
Government 

He  will  then  leave  for  Europe  on  October  19  and  visit  Brussels  and 
London  before  returning  to  Hong  Kong  on  October  23. 
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Figure  5.  5  Example  of  scattering  lines  on  document  image 
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Chief  Executive  to  renew  ties  with  our  trading  partners 
The  Chief  Executive,  Mr  Tung  Chee  Hwa,  will  visit  Malaysia, 
Singapore,  the  United  States,  Japan,  Belgium  and  the  UK  in  the 
next  two  months  to  renew  ties  with  our  trading  partners  and 
update  them  on  developments  in  Hong  Kong  after  the  handover. 
During  these  visits,  Mr  Tung  will  call  on  political  and  business 
leaders,  and  brief  them  on  the  smooth  transition  in  Hong  Kong  to 
demonstrate  our  pride  to  return  to  China  and  our  confidence  in 
'Hong  Kong  people  ruling  Hong  Kong"  under  the  "one  country, 
two  systems"  concept. 

Mr  Tung  will  be  In  Kuala  Lumpur  on  September  3  and  call  on 
Prime  Minister  Dr  Mahathir  Mohamad. 

He  will  be  in  Singapore  on  September  4  and  5  and  call  on  Prime 
Minister  Goh  Chok  Tong. 

Mr  Tung  will  be  in  the  United  States  from  September  9  to  1 1 .  He 
will  visit  Washington  and  New  York. 

In  Washington,  Mr  Tung  looks  forward  to  meeting  with  President 
Clinton. 

Mr  Tung  will  also  officiate  at  the  opening  ceremony  of  the  new 
office  premises  of  the  Hong  Kong  Economic  and  Trade  Office  in 
Washington. 

Mr  Tung  will  be  in  Tokyo  from  October  15  to  17.  He  will  call  on 
ministers,  parliamentarians  and  senior  officials  of  the  Japanese 
Government. 

He  will  then  leave  for  Europe  on  October  19  and  visit  Brussels  and 
London  before  returning  to  Hong  Kong  on  October  23. 


(a) 


/ 


Chief  Executive  to  renew  ties  with  our  trading  partners 
The  Chief  Executive,  Mr  Tung  Chee  Hwa,  will  visit  Malaysia, 

Singapore,  the  United  States,  Japan,  Belgium  and  the  UK  in  the  - 

next  two  months  to  renew  ties  with  our  trading  partners  and 
update  them  on  developments  in  Hong  Kong  after  the  handover. 

During  these  visits,  Mr  Tung  will  call  on  political  and  business 
leaders,  and  brief  them  on  the  smooth  transition  in  Hong  Kong  to. 1 
demonstrate  our  pride  to  return  to  China  and  our  confidence  in  ■,  t  ,  . .  , 
"Hong  Kong  people  ailing  Hong  Kong"  under  the  "one  country, 
two  systems"  concept. 

Mr  Tung  will  be  in  Kuala  Lumpur  on  September  3  and  call  on 
Prime  Minister  Dr  Mahathir  Mohamad. 

He  will  be  in  Singapore  on  September  4  and  5  and  call  on  Prime 
Minister  Goh  Chok  Tong. 

Mr  T ung  will  be  in  the  United  States  from  September  9  to  1 1 .  He 
will  visit  Washington  and  New  York. 

In  Washington,  Mr  Tung  looks  forward  to  meeting  with  President- .  ,r\. 

Clinton.  '  ■  ■  -  ..  7: -.f';.  'Jf. 

Mr  Tung  will  also  officiate  at  the,opening  ceremony  of  the  new  ~ 

office  premises  of  the  Hong  Kbftg  Economic  and  Trade  Office  in  '. ,  . 

Washington.  .  . 

Mr  Tung  will  be  in  Tokyo  from  October  15  to  17.  He  will  call  on 
ministers,  parliamentarians  and  senior  officials  of  the  Japanese 
Government. 

He  will  then  leave  for  Europe  on  October  19  and  visit  Brussels  and 
London  before  returning  to  Hong  Kong  on  October  23. 
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Figure  5.  6  Example  of  merge  (a)  noise  free  image  (b)  noise  template  (c)  degraded  image 


5.3  Pixel  level  noise 


In  addition  to  the  page  level  noise  discussed  in  last  section,  several  pixel  level 
noise  models  are  presented  in  this  section. 

Speckles  are  multiplicative  noise,  and  can  be  expressed  as: 
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Y(u,v)  =  X  (w,  v)(  1  +  n ) 


where  X(u,v)  is  the  original  pixel  intensity,  Y(u,v)  is  the  degraded  value,  and 
n  is  a  random  number.  In  our  model,  we  define  speckles  as  the  randomly  generated 
patterns  with  different  pixels  according  to  the  specific  distributions.  The  effective 
sizes  of  patterns  can  be  chosen  by  the  user,  and  usually  defined  from  1  to  10.  The 
parameters  of  this  method  include  the  frequencies  of  each  speckle  pattern,  and  the 
probability  of  speckle  generation,  which  can  control  the  number  of  speckles  on  the 
image.  The  frequencies  of  each  speckle  size  will  be  used  to  obtain  the  distribution  of 
speckles.  We  use  Cumulative  Distribution  Function  (CDF)  of  the  distribution  and  the 
uniformly  generated  random  number  to  generate  the  speckles  distributed  according  to 
the  given  profile. 

For  example,  if  the  given  distribution  for  speckle  with  size  1  to  5  is: 


Size  1 

2 

3 

4 

5 

Probability  0.33 

0.33 

0.16 

0.1 

0.08 

corresponding  CDF  is: 

Size  1  2 

3 

4 

5 

CDF  0.33  0.66 

0.82 

0.92 

1.0 

Every  time  we  need  to  generate  a  speckle,  we  choose  the  “ceil”  size  number 
where  a  random  number  falls  into.  For  example,  if  the  random  number  falls  between 
0.82-0.92,  size  4  will  be  chosen.  Because  the  random  number  is  distributed 
uniformly,  the  probability  of  choosing  corresponding  speckle  size  is  determined  only 
by  the  interval  of  the  CDF.  Thus  the  probability  of  “showing”  speckles  with  size  4  is 
0.1  in  this  example.  An  example  of  speckles  in  different  patterns  is  shown  in  Figure 
5.7c. 
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Phief  pacmtivo  to  renew  iios  with  our  trading  partners - 

The  Chief  Executive,  Mr  Tung  Chee  Hwa,  wiB  visit  Malaysia, 
Singapore,  the  United  Stales,  Japan.  Belgium  and  the  UK  in  the 
next  two  months  to  renew  ties  with  our  trading  partners  and 
update  them  on  developments  in  Hong  Kong  after  the  handover 
During  these  visits.  Mr  Tung  will  ca :  on  political  and  business 
leaders,  and  brief  them  on  the  smooth  transition  in  Hong  Kong  to 
demonstrate  our  pride  to  return  to  China  and  our  confidence  in 
"Hong  Kong  people  ruling  Hong  Kong"  under  the  ‘one  country, 
two  systems"  concept 

Mr  Tung  wii  be  sn  Kuala  Lumpur  on  September  2  and  call  on 
Prime  Minister  Dr  Mahathir  Mohamad 
He  vwll  be  m  Singapore  on  September  4  and  S  and  call  on  Prime 
Minister  Goh  Chok  Tong. 

Mr  T ung  «M  be  m  the  United  States  from  September  9  to  1 1 .  He 
will  visit  Washington  and  New  Yak. 

In  Washington.  Mr  Tung  looks  forward  to  meeting  with  President 
Clinton. 

Mr  Tung  will  also  officiate  at  the  opening  ceremony  of  the  new 
office  premises  of  the  Hong  Kong  Economic  and  Trade  Office  In 
Washington. 

Mr  Tung  wfl  be  in  Tokyo  from.  October  15  to  17  He  will  call  on 
ministers,  parliamentanans  and  senior  officials  of  Ihe  Japanese 
Government. 

He  will  then  leave  for  Europe  on  October  19  and  visit  Brussels  and 
London  before  reluming  to  Hong  Kong  on  October  23. 
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Figure  5.  7  Example  of  speckles  and  jitter  (a)  degraded  image  (b)  jitter  effect  (c)  speckles 
Jitter  is  one  of  the  local  noises  introduced  in  our  method  to  mimic  the  effect  of 

disturbance  of  sampling  grids  during  copying  and  scanning.  This  white  noise  jitter 

samples  uniformly  in  the  2D  windows  with  size  r  centered  at  the  given  pixel,  as  given 

by  the  following  function: 

Y(u,v )  =  X{u  +  rl,v  +  r2 ) 

where,  Y(u,v)  is  the  new  intensity  value  for  pixel  at  (u,v),  X(u  +  rvv  +  r2)  is 
the  intensity  at  (u  +  rl,v+r2)  ,  r,  and  r2  are  independent  random  variables  uniformly 
distributed  inside  the  widow  with  size  of  r. 
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Jitter  will  invert  the  pixels  along  the  edge  with  higher  probability  than  those  in 
homogeneous  regions.  Figure  5.7b  shows  the  effect  of  jitter  with  r  =  2.  This  effect  is 
similar,  but  not  the  same,  to  following  degradation  method,  pixel  flipping. 

Based  on  the  DDM  proposed  in  [5],  pixel  flipping  can  invert  a  given  pixel 
from  black  to  white  or  vise  versa.  This  method  has  seven  parameters;  the  first  one  is 
the  seed  used  for  the  random  number  generator.  The  following  5  parameters  are  used 
in  the  following  function: 

P(  X=1  I  x=0,  di )  =  Po  +  A0exp(-Bo  di2); 

P(  x=0  I  x=l,  d2 )  =  Po  +  Aiexp(-Bi  d22); 

where  P(  x=l  I  x=0,  di)  is  the  probability  of  a  given  background  pixel  inverted 
to  foreground,  di  is  the  distance  from  background  to  foreground;  while  P(x=0lx=l,  d2) 
is  the  probability  of  a  given  foreground  pixel  inverted  to  background,  d2  is  the 
distance  from  foreground  to  background.  The  effect  of  pixel  drift  can  be  seen  in 
Figure  5.8. 


(a) 


(b) 


Figure  5.  8  The  effect  of  pixel  level  degradation,  (a)  Original  image;  (b)  Degraded  in  low 
noise  level;  (c)  Degraded  in  high  noise  level. 
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5.4  Bleed-through  and  show-through  noise 

5.4.1  Background 

Bleed-through  is  often  found  in  many  ancient  manuscripts  and  is  caused  in 
pail  by  ink  seeping  from  the  reverse  side  of  the  manuscript.  Show-through  appears  in 
scanned  double-sided  printed  document  images  when  the  paper  is  not  completely 
opaque,  and  the  light  in  the  scanner  is  allowed  to  reflect  back  through  the  document. 
Figure  5.9  is  an  example  of  scanned  newspaper. 


Women  nab  No.  7  seed  in  Philly  regional  against  Wisconsin-G 

THE  DIAMDI 


THE  UNIVERSITY  OF  MARYLAND’S  INDEPENDENT  STUDENT  NEWS] 


Terps  miss  NCAA  tourney,  en 

A  big 
dose  of 
March 
sadness 


Students  wallow 
in  Terrapins’ 
mediocre  season 


Lynn  Vosburg  sat  in  Cornerstone  Grill 
and  Loft  intently  watching  the  CBS  Se¬ 
lection  Show  for  any  mention  of  her 
beloved  Terrapins. 

But  the  regional  bracket  announce¬ 
ments  came  and  went,  and  for  the  first 
time  in  more  than  a  decade,  Vosburgh 
won’t  get  to  cheer  for  the  Terps  in  the 
NCAA  tournament. 

“It’s  a  disappointment  for  sure.  It’s  a 
letdown,”  said  Vosburgh,  a  1980  gradu¬ 
ate  and  longtime  Terp 
=!  fan.  “It’s  been  a  bad 
year  for  Maryland 

“It's  a  letdown.  spor,s  ••• rve  «ot  tears 

in  my  eyes.” 

It’s  been  a  bad 


Figure  5.  9  The  show-through  effect  of  a  scanned  newspaper 
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Although  this  type  of  artifact  is  often  encountered  in  document  imaging, 
existing  degradation  such  as  [15]  [18],  have  not  explicitly  investigated  this  type  of 
noise.  In  [31],  several  synthetic  show-through  or  bleed- through  images  were 
generated  to  test  noise  cancellation  and  recovery  algorithms.  However,  to  make  the 
data  model  a  noiseless  mixture  for  applying  Independent  Component  Analysis  (ICA) 
methods,  the  authors  assumed  no  noise  or  blur,  which  made  the  generated  images 
unrealistic.  In  [32],  the  author  analyzes  the  show-through  phenomenon  using  first 
physical  principles,  and  models  it  with  a  linear  function  of  reflectance  and 
transmittance.  Because  his  focus  was  on  removing  the  show-through  noise,  the 
author  in  [32]  uses  this  model  to  design  a  linear  filtering  scheme  instead  of  generating 
synthetic  images. 

The  necessity  and  the  work  aforementioned  above  have  inspired  us  to 
incorporate  show-through  effect  into  our  degradation  model,  which  is  described  in 
detail  in  following  section. 

5.4.2.  Approach 

To  obtain  an  image  with  show-through  effect  from  a  given  front  side  image 
and  a  back  side  image,  we  reverse  the  back  side  image  left-to-right,  and  then  blur  it 
with  a  Gaussian  low  pass  filter.  We  combine  the  preprocessed  back  side  image  with 
the  front  side  image  to  generate  the  synthetic  one. 

It  is  relatively  easy  when  either  the  front  side  or  the  back  side  image  is  binary. 
We  threshold  the  other  image  into  binary  if  it  is  grayscale,  and  apply  logical  “OR” 
operator  to  both  binary  images.  An  example  of  combing  a  grayscale  back  side  image 
with  a  binary  front  side  image  is  shown  in  Figure  5.10. 
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Chief  Executive  to  renew  ties  with  our  trading  partners 
The  Chief  Executive,  Mr  Tung  Chee  Hwa,  will  visit  Malaysia, 
Singapore,  the  United  States,  Japan,  Belgium  and  the  UK  in  the 
next  two  months  to  renew  ties  with  our  trading  partners  and 
update  them  on  developments  in  Hong  Kong  after  the  handover. 
During  these  visits,  Mr  Tung  will  call  on  political  and  business 
leaders,  and  brief  them  on  the  smooth  transition  in  Hong  Kong  to 
demonstrate  our  pride  to  return  to  China  and  our  confidence  in 
"Hong  Kong  people  ruling  Hong  Kong"  under  the  "one  country, 
two  systems"  concept 

Mr  Tung  will  be  in  Kuala  Lumpur  on  September  3  and  call  on 
Prime  Minister  Dr  Mahathir  Mohamad. 

He  will  be  in  Singapore  on  September  4  and  5  and  call  on  Prime 
Minister  Goh  Chok  Tong. 

Mr  Tung  will  be  in  the  United  States  from  September  9  to  1 1 .  He 
will  visit  Washington  and  New  York. 

In  Washington,  Mr  Tung  looks  forward  to  meeting  with  President 
Clinton. 

Mr  Tung  will  also  officiate  at  the  opening  ceremony  of  the  new 
office  premises  of  the  Hong  Kong  Economic  and  Trade  Office  in 
Washington. 

Mr  Tung  will  be  in  Tokyo  from  October  15  to  17.  He  will  call  on 
ministers,  parliamentarians  and  senior  officials  of  the  Japanese 
Government. 

He  will  then  leave  for  Europe  on  October  19  and  visit  Brussels  and 
London  before  returning  to  Hong  Kong  on  October  23. 


acrid  prcMeci.  The  program  it  designed  fur  completion  ic  one  cs'cmLir  year  of  full-time 
study. 

5.  Master  of  Entetuiomnu  Technology  degree  requires  1 77  units  of  ooursc  work  and  n 
jointly  offered  by  CMC s  College  of  Roe  Arts  and  School  of  Ccagvxz  Science  It  is  a 
full  time,  two-year  program. 

6.  Master  of  Human  Computer  Interaction  program  aims  to  prepare  students  to  participate 
in  the  design  ard  implementation  of  soft ware  system*  that  can  be  used  easily,  effectively 
ami  cnjoyably.  The  program  supports  membership  from  the  School  of  Conwer  Science, 
the  Graduate  School  of  Industrial  Administralico.  the  School  of  Hunueitle*  and  Social 
Sciences,  the  College  of  Fine  Art*,  the  Robotic*  Institute,  and  the  Software  Engineering 
Institute,  acd  a  research  and  teaching  faculty  with  a  substantial  fccus  on  human  use  Of 
computing.  The  program  can  be  completed  on  a  full  time  basis  in  twelve  months:  Iwo 
semesters  and  one  summer.  The  curriculum  consists  of  ten  conventional  semester  long 
courses  and  an  extensive  team-ocientcd  studic'pcujccl  experience.  Student*  take  courses, 
often  at  the  advanced  undergradssie  level.  to  obtain  a  brood  background  in  computer 
science,  human  behavior,  design,  and  evaluation  and  assessment,  and  may  elect  to  lake 
mote  advanced  courses  to  deepen  their  knowledge  in  a  mere  specific  area. 

7.  Master  of  Science  in  Knowledge  Discovery  and  Data  Mining  curriculum  is  based  cc  core 
academic  courses  on  Aulomaeed  Learning  and  Discovery .  Statistical  Approaches  for 
Learning  and  Discovery,  and  Algorithms  for  Learning  and  Discovery.  Students  also  gain 
handvee  experience  through  Project  Wotk.  and  a  Ub  Course.  The  third  component  of 
the  curriculum  consist*  of  elective  courses  drawn  from  computer  science,  statistics  and 
other  relevant  disciplines.  The  program  can  be  completed  via  a  12-motfh  course  of  full¬ 
time  study  during  J  coiseoxivc  Fall.  Spring,  and  Summer  teems,  cc  via  a  2  year  (4 
academic  term)  course  of  study,  including  project  work  during  the  intervening  summer. 
Tb«  track  Is  designed  primarily  for  those  unending  io  apply  for  a  Research  or  Teaching 
Asjdslantshap 

8.  Master  of  Science  in  Language  Technologies  (MLT)  curriculum  will  consist  of  130  or 
more  course  units,  at  least  96  of  which  must  be  selected  from  this  bsl  of  MLT-apprnved 
core  courses  (most  will  be  1 2  unit&feOuneX.  which  include  the  hicd^oa  *elf-p*ce.1 
laborMoty  and  the  24-unit  Software  Engineering  for  LT.  These  courses  assume 
knowledge  of  programming  and  data  structures  The  curriculum  is  targeted  primarily 
toward  a  professional  degree;  with  some  modifications  acd  enhancements,  it  also  forms 
the  course-based  component  of  the  Ph.D.  program.  We  expect  some  of  the  nwrc  research 
Oriented  MLT  sludenis  to  apply  for  continuing  studies  imo  their  Ph.D..  with  most  of  tfcclr 
MLT  courses  and  hands-on  week  being  credited  towards  the  Ph.D 

9.  Master  of  Science  in  Robotics  program  a  designed  for  completion  by  full-time  students 
in  12  months.  There  are  also  a  limited  number  of  research  and  leaching  assistaecship*. 
available.  An  advanced  degree  in  robotics  require*  both  understanding  »  range  of 

lo.hr.xal  fields,  and  having  experience  with  synthesizing  real  systems.  The  curriculum 
foe  the  Master's  degree  reflects  both  the  breadth  and  the  hands-on  nature  of  Robotics.  The 
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(c)  Synthetic  show-through  image 

Figure  5.  10  Example  of  binary  front  side  image 
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The  more  interesting  case,  where  both  front  and  back  side  images  are 
grayscale,  uses  the  following  equation  to  compute  the  new  bleed-through/show- 
through  image: 

where  /(•)  is  the  front  side  image,  b(-)  is  the  back  side  image,  R(-) is  the 
reverse  function,  H  is  the  “blurring  matrix”  corresponding  to  a  shift-invariant  point- 
spread  function,  N  is  assumed  here  to  be  additive  and  independent  identical  noise, 
g(-)  is  the  generated  bleed-through/show-through  image,  ©(•,•)  is  the  transformation 
function.  In  our  preliminary  experiments,  we  choose  ©(•,•)  as  a  linear  function  to 
simplify  the  model.  If  the  front  side  image  and  back  side  image  are  not  the  same  size, 
we  resize  the  back  side  image  to  the  size  of  front  side  image  in  the  preprocessing 
stage.  Then  for  each  pixel  (i,  j) 

f f,j  -  cc(fi  j  -  B  )  + 1 1  if  f  -  B  >  i threshold 

A/H  , 

[A/ + 

where: 

B:  H  ®  R(b(-)) ,  i.e.  blurred  and  reversed  back  side  image 

a  and  ithreshold:  parameters  to  control  the  attenuation  rate. 

This  model  is  based  on  two  observations.  The  first  is  that  only  when  the 
intensity  difference  between  front  and  back  side  image  is  large  enough,  there  is  show- 
through  effect.  In  other  words,  if  the  front  side  pixel  is  too  dark,  or  the  back  side 
pixel  is  too  light,  the  intensity  of  the  original  front  side  pixel  becomes  dominant  in  the 
new  pixel.  The  second  observation  is  that  show-through  can  only  make  the  new  pixel 


52 


darker  than  front  side  pixel.  To  justify  our  method,  an  example  is  given  in  the 
following  section. 

5.4.3  Experiments 

We  generated  a  synthetic  image  from  the  proposed  approach,  and  compare  it 
visually  with  the  real  scanned  newspaper  image,  such  as  Figure  5.9  to  demonstrate 
the  effectiveness.  Our  goal  was  to  obtain  two  blank  newspaper  pages  with  same  size 
and  quality,  print  some  text  on  the  first  sheet,  and  scan  it  to  obtain  the  front  side 
image.  Then  obtain  the  back  side  image  using  the  same  method  with  the  second 
sheet.  Finally  print  the  back  side  image  on  the  back  side  of  the  first  sheet  to  obtain 
the  real  scanned  image  with  show-through  effect.  Because  it  is  hard  to  obtain  the 
machine  similar  to  those  printing  newspapers  and  blank  newspaper,  we  extracted  the 
front  side  image  and  back  side  image  directly  from  Figure  5.9. 

To  obtain  a  “pure”  back  side  image  of  the  newspaper,  we  place  a  black  paper 
between  the  newspaper  and  scanner  backing  to  limit  undesirable  scan-through  effect, 
not  remove  it  completely.  However,  it  is  acceptable  due  to  the  fact  that  the  intensity 
of  back  side  pixel  attenuates  significantly  in  generating  the  show-through  image,  and 
the  faint  show-through  effect  on  back  side  image  decays  much  faster.  It  is  hard  to 
obtain  a  “pure”  front  side  image  in  this  way  because  the  intensity  of  front  side  pixels 
are  usually  dominant  in  generating  scan-through  image,  which  is  thus  sensitive  to  the 
“seeping”  ink. 

Fortunately,  we  can  separate  the  scanned  image  into  three  layers:  foreground, 
which  is  the  printed  text  or  figures;  background,  which  contains  mostly  the  texture  of 
newspaper;  and  the  unwanted  reverse  side  layer.  With  the  thresholds  calculated 
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through  the  newspaper  image,  we  cluster  the  pixels  into  three  layers,  and  remove 
those  labeled  as  unwanted  layer.  To  make  the  background  smoother,  we  use  the 
nearest  background  pixel  intensity  to  fill  in  the  positions  of  removed  pixels.  Figure 
5.11  shows  the  front  side  image  and  back  side  image  respectively.  There  is  almost  no 
show-through  effect  in  the  front  side  image  now. 
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Figure  5.11  (a)  Front  side  image;  (b)  Back  side  image  from  Figure  5.9 
Figure  5.12  is  the  comparison  of  synthetically  generated  image  and  the  real 

scanned  image.  Visually,  they  are  similar.  Because  we  have  “filtered”  some  of  the 

noise  and  stains  when  obtaining  front  side  image,  the  synthetic  image  seems  a  little 

lighter  and  has  a  more  homogeneous  background.  Thus  improvement  is  possible  if 

we  can  obtain  the  exact  front  side  image.  The  parameter  a  and  threshold  are  currently 

chosen  manually,  but  can  be  computed  by  analyzing  the  changes  of  the  front  side  and 

back  side  image  pixels. 
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(b) 

Figure  5.  12  (a)  synthetically  generated  scan-through  image  with  a  =0.1,  and  threshold  =  60 
(b)  real  scanned  newspaper 

5.5  Implementation 

All  the  degradation  methods  have  been  implemented  in  C++,  and 


encapsulated  into  DLDegradation  class,  which  can  be  incorporated  and  used  as  an 
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API  in  other  systems.  DLDegradation  class  uses  the  basic  image  data  format  defined 
in  DocLib  library.  The  following  table  is  the  functions  and  parameter  list: 


Noise  Type 

Parameters 

Type 

Min 

Max 

Blur 

Filter  Size 

Int 

3 

11 

Std  Deviation 

Float 

0 

10.0 

Passes  Number 

Int 

1 

10 

Blur  Probability 

Float 

0 

1.0 

Threshold 

Int 

0 

255 

Resolution 

Horizontal  Res. 

Int 

50 

Image  Res. 

Vertical  Res. 

Int 

50 

Image  Res. 

Threshold 

Folat 

0.0 

100.0 

HLINE 

Min.  Length 

Int 

1 

Image  width  -  1 

Max.  Length 

Int 

Min.  Length 

Image  width 

Min.  Width 

Int 

1  (-1)  1 

20  (-20) 1 

Max.  Width 

Int 

Min.  Width 

20  (-20)  1 

Min.  Density 

Float 

0.0 

1.0 

Max.  Density 

Float 

Min.  Density 

1.0 

Number  of  Lines 

Int 

1 

100 

VLINE 

Min.  Length 

Int 

1 

Image  height  -  1 

Max.  Length 

Int 

Min.  Length 

Image  height 

Min.  Width 

Int 

1  (-1)  1 

20  (-20)  1 

Max.  Width 

Int 

Min.  Width 

20  (-20) 1 

Min.  Density 

Float 

0.0 

1.0 

Max.  Density 

Float 

Min.  Density 

1.0 

Number  of  Lines 

Int 

1 

100 

Rotation 

Rotate  Point  X 

Int 

N/A 

N/A 

Rotate  Point  Y 

Int 

N/A 

N/A 

Rotate  Angle 

Float 

-180.0 

180.0 

Speckles 

Dist.  Values  for 
10  Sizes 

Float 

0.0 

100.0 

Probability 

Float 

0.0 

1.0 

Pixelflip151 

R.V.  Seed 

Int 

0 

N/A 

Po 

Float 

0.0 

1.0 

Ao 

Float 

Bo 

Float 

Ai 

Float 

Bo 

Float 

D 

Int 

2 

Merge 

Image  file 

string 

Show-through 

Image  file 

String 

a 

Float 

0.0 

1.0 

Threshold 

Int 

1 

255 

Table  4  Function  and  parameter  list  for  document  degradation  class 
Remark:  1.  When  specified  as  the  negative,  this  function  will  remove  lines  from  the  image. 
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Chapter  6  Summary 


Ground  truth  collection  plays  an  important  role  in  document  analysis  system 
training  and  evaluation.  Manually  generating  data  sets  is  labor-intensive,  and  error- 
prone.  Furthermore,  it  is  prohibitively  expensive  to  get  representative  multilingual 
data  sets  with  thousands  of  pages.  Using  existing  data  sets  may  partly  alleviate  the 
cost.  However,  it  is  not  flexible  enough  to  evaluate  the  underlying  system,  which 
may  require  specific  vocabulary  and  special  document  styles.  With  the  increased 
interest  in  processing  multilingual  sources,  however,  there  is  a  tremendous  need  to  be 
able  to  rapidly  generate  data  in  new  languages  and  scripts,  without  the  need  to 
develop  specialized  systems. 

6.1  Summary  of  contributions 

The  main  contributions  presented  in  this  thesis  are: 

1.  We  have  proposed  and  implemented  a  methodology  to  automatically  generate 
ground  truth  from  electronic  text.  This  method  produces  the  complete  ground 
truth,  including  symbolic  text  files  and  noise  free  images  at  different 
resolutions,  which  can  be  used  in  training  or  evaluating  document  analysis 
systems.  It  is  extremely  flexible  and  convenient  when  dealing  with  new 
languages  and  scripts.  For  most  of  the  languages  of  interest,  the  electronic 
text  can  be  copied  and  pasted  from  the  website  so  that  no  manual  input 
processing  is  needed. 
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2.  We  have  proposed  a  method  to  transform  ground  truth  files  from  ideal  images 
to  degraded  images.  This  method  modeled  the  transformation  as  a  linear 
projective  transformation.  Because  we  use  four  feature  points  to  position  the 
bounding  boxes  of  document,  our  method  is  robust  in  noisy  documents. 

3.  We  have  integrated  a  multi-lingual  OCR  evaluation  system,  and  have 
evaluated  two  Chinese  OCR  systems. 

4.  We  have  proposed  and  implemented  a  document  image  degradation 
methodology.  This  method  incorporates  page  level  and  pixel  level  noise  often 
encountered  in  printing,  copying,  faxing,  and  scanning.  Rotation,  blur, 
scattering  lines,  resolutions  change,  noise  template  merge,  speckles,  jitter, 
pixel  drift,  and  show-through  are  included  in  this  method. 
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